Michael Clark
Statistician Lead
Consulting for Statistics, Computing and Analytics Research
Advanced Research Computing
The following provides a brief introduction to generalized additive models and some thoughts on getting started within the R environment. It doesn’t assume much more than a basic exposure to regression, and maybe a general idea of R though not necessarily any particular expertise. The presentation is of a very applied nature, and such that the topics build upon the familiar and generalize to the less so, with the hope that one can bring the concepts they are comfortable with to the new material. The audience in mind is a researcher with typical social science training, though is not specific to those disciplines alone.
As this document is more conceptual, a basic familiarity with R is all that is needed to follow the code, though there is much to be gained from simple web browsing on R if one needs it. And while it wasn’t the intention starting out, this document could be seen as a vignette for the mgcv package, which is highly recommended.
This document was created within Rstudio and rmarkdown. Last modified 2016-05-09. Original draft August, 2012.
Color guide:
R Info: R version 3.3.0 (2016-05-03) Supposedly Educational
Let us begin by considering the standard linear regression model (SLiM) estimated via ordinary least squares (OLS). We have some response/variable we wish to study, and believe it to be some function of other variables. In the margin to the right, \(y\) is the variable of interest, \[y\sim \mathcal{N}(\mu,\sigma^{2})\]
\(\mu = b_{0}+b_{1}X_{1}+b_{2}X_{2}...+b_{p}X_{p}\) assumed to be normally distributed with mean \(\mu\) and variance \(\sigma^2\), and the Xs are the predictor variables/covariates in this scenario. The predictors are multiplied by the coefficients (beta) and summed, giving us the linear predictor, which in this case also directly provides us the estimated fitted values.
And since we’ll be using R and might as well get into that mindset, I’ll give an example of how the R code might look like.
mymod = lm(y ~ x1 + x2, data=mydata)One of the issues with this model is that, in its basic form it can be very limiting in its assumptions about the data generating process for the variable we wish to study. Also, while the above is one variant of the usual of the manner in which it is presented in introductory texts, it also very typically will not capture what is going on in a great many data situations.
In that light, we may consider the generalized linear model. Generalized linear models incorporate other types of distributionsOf the exponential family., and include a link function \(g(.)\) relating the mean \(\mu\), or stated differently, the estimated fitted values \(E(y)\), to the linear predictor \(X\beta\), often denoted \(\eta\). The general form is thus \(g(\mu) = X\beta\).
\[g(\mu) = \eta = X\beta \\ E(y) = \mu = g^{\tiny-1}(\eta)\]
Consider again the typical linear regression. We assume a Gaussian (i.e. Normal) distribution for the response, we assume equal variance for all observations, and that there is a direct link of the linear predictor and the expected value \(\mu\), i.e. \(X\beta = \mu\). In fact the typical linear regression model is a generalized linear model with a Gaussian distribution and identity link function.
To further illustrate the generalization, we consider a distribution other than the Gaussian. In the example to the right, we examine a Poisson distribution for some vector of count data.
\[y \sim \mathcal{P}(\mu) \\
g(\mu) = b_{0}+b_{1}X_{1}+b_{2}X_{2}...+b_{p}X_{p}\]
There is only one parameter to be considered, \(\mu\), since for the Poisson the mean and variance are equal. For the Poisson, the (canonical) link function \(g(.)\), is the natural log, and so relates the log of \(\mu\) to the linear predictor. As such we could also write it in terms of exponentiating the right hand side. \[\mu = e^{b_{0}+b_{1}X_{1}+b_{2}X_{2}...+b_{p}X_{p}}\]
While there is a great deal to further explore with regard to generalized linear models, the point here is to simply to note them as a generalization of the typical linear model with which all would be familiar after an introductory course in statistics. As we eventually move to generalized additive models, we can see them as a subsequent step in the generalization.
Now let us make another generalization to incorporate nonlinear forms of the predictors. The form \[ y \sim ExpoFam(\mu, etc.) \\ \mu = E(y) \\ g(\mu) = b{_0} + f(x_1) + f(x_2) +...+f(x_p)\] at the right gives the new setup relating our new, now nonlinear predictor to the expected value, with whatever link function may be appropriate.
So what’s the difference? In short, we are using smooth functions of our predictor variables, which can take on a great many forms, with more detail in the following section. For now, it is enough to note the observed values \(y\) are assumed to be of some exponential family distribution, and \(\mu\) is still related to the model predictors via a link function. The key difference now is that the linear predictor now incorporates smooth functions \(f(x)\) of at least some (possibly all) covariates.
In many data situations the relationship between some covariate(s) \(X\) and response \(y\) is not so straightforward. Consider the plot at the right. Attempting to fit a standard OLS regression results in the blue line, which doesn’t capture the relationship very well.
\[ y \sim \mathcal{N}(\mu,\sigma^{2})\\ \mu = b_{0}+b_{1}X_{1}+b_{2}X^2 \] One common approach we could undertake is to add a transformation of the predictor \(X\), and in this case we might consider a quadratic term such that our model looks something like that to the right.
We haven’t really moved from the general linear model in this case, but we have a much better fit to the data as evidenced by the graph.
There are still other possible routes we could take. Many are probably familiar with loess (or lowess) regression, if only in the sense that often statistical packages may, either by default or with relative ease, add a nonparametric fit line to a scatterplotSee the scatterplots in the car package for example.. By default this is typically a lowess fit.
Take a look at the following figure. For every (sorted) x\(_{0}\) value, we choose a neighborhood around it and, for example, fit a simple regression. As an example, let’s look at x\(_{0}\) = 3.0, and choose a neighborhood of 100 X values below and 100 values above.
Now, for just that range we fit a linear regression model and note the fitted value where \(x_{0}=3.0\).
If we now do this for every X value, we’ll have fitted values based on a rolling neighborhood that is relative to each X value being considered. Other options we could throw into this process, and usually do, would be to fit a polynomial regression for each neighborhood, and weight observations the closer they are to the X value, with less weight given to more distant observations.
The next plot shows the result from such a fitting process, specifically lowess, or locally weighted scatterplot smoothing. For comparison the regular regression fit is also provided. Even without using a lowess approach, we could have fit have fit a model assuming a quadratic relationship, \(y = x + x^2\), and it would result in a far better fit than the simpler model\(y\) was in fact constructed as \(x + x^2 +\) noise.. While polynomial regression might suffice in some cases, the issue is that nonlinear relationships are generally not specified so easilyBolker 2008 notes an example of fitting a polynomial to 1970s U.S. Census data that would have predicted a population crash to zero by 2015.. One can see this a variety of approaches in the next figure, which is a version of that found in Venables and Ripley (2002).
The next figure regards a data set giving a series of measurements of head acceleration in a simulated motorcycle accident. Time is in milliseconds, acceleration in g. Here we have data that are probably not going to be captured with simple transformations of the predictors. We can compare various approaches, and the first is a straightforward cubic polynomial regression, which unfortunately doesn’t help us much. We could try higher orders, which would help, but in the end we will need something more flexible, and generalized additive models can help us in such cases.
Let’s summarize our efforts up to this point.
\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[\mu = b_{0}+b_{1}X_{1}\]
\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[\mu = b_{0}+b_{1}X_{1}+b_{2}X^2\]
\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[g(\mu) = b_{0}+b_{1}X_{1}+b_{2}X^2\]
\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[g(\mu) = f(X)\]
Now that some of the mystery will hopefully have been removed, let’s take a look at GAMs in action.
Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with S. Birkhäuser.